<<<<<<< Updated upstream

title: “Generalised linear models” description: “Estimating binary responses, counts, rates, overdispersion, and zero-inflation” author: - name: Joseph Bulbulia url: https://josephbulbulia.netlify.app affiliation: Victoria University of Wellington affiliation_url: https://www.wgtn.ac.nz orcid_id: 0000-0002-5861-2056 date: 2021-APRIL-27 output: distill::distill_article: self_contained: false toc: true highlight: kate bibliography: bib.bib —

Objectives

  1. To understand the concept of a “generalised linear model”, and why the standard (gaussian) linear model is a special case.
  2. To undertand how to write four classes of generalised linear models:
  1. To understand how to interpret and report the results of a generalised linear model.

Introduction

Recall that a regression model combines data from a sample, on the one hand, with the tools of probability theory, on the other hand, to infer features of an unobserved population. We call these unobserved features “parameters.” The task in regression is to estimate uncertainty in these parameters, and where possible, to narrow the bandwidth of such uncertainty. In a nutshell, regression is educated guesswork.

Our focus has been on teaching workflows for estimation that are grounding in three imperatives:

  1. to clarify our question: what do we want to infer?
  2. to clarify our assumptions: what do we believe about the world and our data before estimation gets going
  3. to clarify our decisions: inevitably applied regression modelers must make choices: what have we decided and why?

So far, our models have assumed that response data in our models are sampling from a “normal” or “gaussian” distribution. This assumption is useful because we live i an ordered universe from which many of our observations are sampling from a normal distribution. However, although the gaussian distribution is sensible choice for estimating many parameters it is not a sensible choice for estimating all parameters. The linear regression model that we have been using is really a special case of the generalised linear regression model. Today we introduce the generalised linear model.

Logistic regression

We begin with outcomes that are distributed as binary data. Consider home ownership in the nz jitter dataset:

hist(nz$HomeOwner)

From the graph, we can see response distribution is fully bimodal (excluding the NAs).

Extimating the probability of home ownership in the population

We can write a generalised linear model that predict home ownership as follows:

home <- glm(HomeOwner ~ 1, data = nz, 
            family = "binomial")

Which gives us the following result

parameters::model_parameters(home)
## Parameter   | Log-Odds |   SE |       95% CI |     z |      p
## -------------------------------------------------------------
## (Intercept) |     1.46 | 0.05 | [1.36, 1.55] | 30.38 | < .001

Or plugging this into the equation:

# this is how to quickly generate the equation
equatiomatic::extract_eq(home,  use_coefs = TRUE)

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.46 \]

Intepretation

How do we interpret this result?

We can use the plogis function to obtain the probability

plogis(coef(home)[[1]])
## [1] 0.811068

The probability of an NZAVS participant owning their own home is 0.811068

Notably, the logistic regression has estimated the sample mean to seven decimal places.

mean(nz$HomeOwner, na.rm=TRUE)
## [1] 0.811068

Logistic regression with a single co-variate.

Workflow: center and scale continuous predictors

We do this as follows.

# work around for compatibility issue between ggeffects and dplyr. 
nz['Household.INC_s'] <- as.data.frame( scale(nz$Household.INC) )

To ge a reference point for a standard deviation of household income, let’s find the standard deviation for the sample:

# how much is a standard deviation -- which will represent a 1 unit change for the regression coefficient? 
sd( nz$Household.INC, na.rm = TRUE ) # 95,090
## [1] 95090.79

Let’s ask about the mean household for income in 2018?

mean( nz$Household.INC, na.rm = TRUE )
## [1] 112877.1

Model syntax

We write a logistic regression model with a single covariate as follows:

home2 <- glm(HomeOwner ~ Household.INC_s, data = nz, 
            family = "binomial")

Results:

rs2<-parameters::model_parameters(home2)
rs2
## Parameter       | Log-Odds |   SE |       95% CI |     z |      p
## -----------------------------------------------------------------
## (Intercept)     |     1.58 | 0.05 | [1.48, 1.69] | 29.37 | < .001
## Household.INC_s |     0.72 | 0.08 | [0.57, 0.89] |  8.86 | < .001
plot(rs2)

# this is how to quickly generate the equation
equatiomatic::extract_eq(home2,  use_coefs = TRUE)

\[ \log\left[ \frac { \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{HomeOwner} = \operatorname{1} )} } \right] = 1.58 + 0.72(\operatorname{Household.INC\_s}) \]

Interpretation

How do we interpret this model?

Let’s use the report package

report::report(home2)
## We fitted a logistic model (estimated using ML) to predict HomeOwner with Household.INC_s (formula: HomeOwner ~ Household.INC_s). The model's explanatory power is weak (Tjur's R2 = 0.04). The model's intercept, corresponding to Household.INC_s = 0, is at 1.58 (95% CI [1.48, 1.69], p < .001). Within this model:
## 
##   - The effect of Household.INC_s is significantly positive (beta = 0.72, 95% CI [0.57, 0.89], p < .001; Std. beta = 0.73, 95% CI [0.57, 0.89])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

There’s lots of mention of p-values and power claims in the automated report, but what do these words practically mean?

Workflow: graph your results.

We are cultivating a habit of graphing our results. Let’s put this habit to virtuous use.

Here’s a shortcut for obtaining a quick, investigative plot:

plot(nz$Household.INC_s, nz$HomeOwner) 
curve(invlogit(coef(home2)[1] + coef(home2)[2]*x), add=TRUE)

The slightly longer syntax for a ggeffects graph is:

plot(ggeffects::ggpredict(home2, terms = "Household.INC_s"), add.data = TRUE, alpha =.1)

And here we find a curious feature of our data. Income is diffuse. We have someone that makes over 25 standard deviations more than the average income.

hist(nz$Household.INC, breaks = 1000)

Although we graph home ownership, we forgot to graph Household income.

Before launching into a model, your workflow should include graphing your predictors as well as your responses

The range of incomes is: 0, 2.5^{6}.

Note that we also have 8 who report making less than 10000,

Sensitivity of the data to outliers?

Let’s re-run the model while eliminating the extremely rich, and restrict focus to 98% of the data. This restriction would appear to be justified. It would not be surprising if very rich people were to own their own homes.

# Select 98 % of the range
nz2 <-  nz%>%
  dplyr::filter(Household.INC_s < 4)
##
nrow(nz2)/nrow(nz)
## [1] 0.979438
home2.1 <- glm(HomeOwner ~ Household.INC_s, data = nz2, 
            family = "binomial")
parameters::model_parameters(home2.1)
## Parameter       | Log-Odds |   SE |       95% CI |     z |      p
## -----------------------------------------------------------------
## (Intercept)     |     1.60 | 0.05 | [1.50, 1.71] | 29.21 | < .001
## Household.INC_s |     0.78 | 0.08 | [0.62, 0.95] |  9.27 | < .001

Note that we need a sensible range. The lowest value is not 4 SD from the mean

hist(nz2$Household.INC_s)

range(nz2$Household.INC_s, na.rm = T)
## [1] -1.187046  3.755599

Does this matter? Not necessarily. Linear regression does not require that the predictors sample from a normal distribution. This should be obvious, as we have used categorical predictors (e.g. Male/Not_Male). However it is worth explicitly stating. I have heard people who should know better express confusion on this point. Of course, that there are no parametric assumptions for your predictor variables does not let you off the hook. In this case, the extreme values might be distorting our inference.

Let’s write a model with the diminished dataset.

#range(nz$Household.INC_s, na.rm = T)

mp2 <- plot(ggeffects::ggpredict(home2,
                                 terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(limits = c(-1.2, 4))
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
                                   terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) +
  scale_x_continuous(limits = c(-1.2, 4))

library(patchwork)
mp2 + mp2.1 + 
  plot_annotation(title = "Comparison of logistic regression models with tranformations",
                              tag_levels = 'a') 

Notice the trick that I used here:

scale_x_continuous(limits = c(-1.2, 4))

This code allowed me to constrain both graphs to the same x-axis scale. If I left this out the graphs would have looked like this:

#range(nz$Household.INC_s, na.rm = T)

mp2 <- plot(ggeffects::ggpredict(home2,
                                 terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) 
mp2.1 <- plot(ggeffects::ggpredict(home2.1,
                                   terms = "Household.INC_s[all]")) + scale_y_continuous(limits = c(0, 1)) 

library(patchwork)
mp2 + mp2.1 + plot_annotation(title = "Comparison of logistic regression models with tranformations",
                              tag_levels = 'a') +
  plot_layout(guides = 'collect')

The models are for all intents and purposes identical.

You can perform further checks using the following code, however, it should be apparent from the preceding graph that the models two models do not vary.

check1 <- performance::check_model(home2)
check1
check2 <- performance::check_model(home2.1)
check2

Before leaving this example, what would an ordinary linear regression sampling from a normal distribution have returned?

home2LM <- lm(HomeOwner ~ Household.INC_s, data = nz)
home2LM_r <-parameters::model_parameters(home2LM)
home2LM_r
## Parameter       | Coefficient |       SE |       95% CI | t(2796) |      p
## --------------------------------------------------------------------------
## (Intercept)     |        0.81 | 7.27e-03 | [0.80, 0.83] |  111.88 | < .001
## Household.INC_s |        0.06 | 7.23e-03 | [0.04, 0.07] |    8.04 | < .001
plot(home2LM_r)

Let’s graph the results and compare the

home2LM_p <- plot(ggeffects::ggpredict(home2LM,
                                   terms = "Household.INC_s[all]"),
                  add.data = TRUE) + 
  scale_y_continuous(limits = c(0, 1.5))  +
  scale_x_continuous(limits = c(-1.2, 4))+ theme_classic()
home2LM_p


home2_p <- plot(ggeffects::ggpredict(home2,
                                   terms = "Household.INC_s[all]"),
                  add.data = TRUE) + 
  scale_y_continuous(limits = c(0, 1.5))  +
  scale_x_continuous(limits = c(-1.2, 4))

home2LM_p + home2_p + plot_annotation(subtitle = "Comparison of ordinary least squares regression \n (a) with logistic regression (b) reveals n\ impossible predictions for OLS") +
  plot_layout(guides = "collect")

Logistic regression with a categorical covariate

Let’s use the GenCohort variable.

mg1 <- glm(HomeOwner ~ GenCohort, data = nz, family = "binomial")
parameters::model_parameters(mg1)
## Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
## --------------------------------------------------------------------------------------------
## (Intercept)                             |     2.13 | 0.10 | [ 1.94,  2.34] |  20.68 | < .001
## GenCohortGen_Silent *  born< 1946       |     0.26 | 0.28 | [-0.26,  0.85] |   0.94 | 0.347 
## GenCohortGenX *  born >=1961 & b.< 1980 |    -0.62 | 0.13 | [-0.87, -0.38] |  -4.90 | < .001
## GenCohortGenY *  born >=1980 & b.< 1996 |    -1.87 | 0.14 | [-2.15, -1.59] | -13.00 | < .001
## GenCohortGenZ *  born >= 1996           |    -4.91 | 1.04 | [-7.80, -3.30] |  -4.74 | < .001

What does this mean? Let’s graph the results

p_mg1 <-plot(ggeffects::ggpredict(mg1, terms = "GenCohort[all]"))
p_mg1

Lets stratify by income

Household.INC_s

mg2 <- glm(HomeOwner ~ GenCohort + Household.INC_s, data = nz, family = "binomial")
parameters::model_parameters(mg2)
## Parameter                               | Log-Odds |   SE |         95% CI |      z |      p
## --------------------------------------------------------------------------------------------
## (Intercept)                             |     2.60 | 0.12 | [ 2.37,  2.84] |  21.79 | < .001
## GenCohortGen_Silent *  born< 1946       |     0.70 | 0.30 | [ 0.15,  1.33] |   2.35 | 0.019 
## GenCohortGenX *  born >=1961 & b.< 1980 |    -0.99 | 0.14 | [-1.26, -0.73] |  -7.31 | < .001
## GenCohortGenY *  born >=1980 & b.< 1996 |    -2.40 | 0.16 | [-2.71, -2.09] | -14.94 | < .001
## GenCohortGenZ *  born >= 1996           |    -4.99 | 1.06 | [-7.91, -3.32] |  -4.72 | < .001
## Household.INC_s                         |     1.19 | 0.10 | [ 0.99,  1.39] |  11.62 | < .001

We really don’t see income making a difference to home ownership for boomers. A little sepearation happens among those in Gen X.

p_mg2 <-plot(ggeffects::ggpredict(mg2, terms = c("GenCohort[all]", "Household.INC_s[c(-1,0,3)]")))
p_mg2

Notes

  • Additive indicators on the logit scale are non-linear on the data scale. We see this in the previous graph. There’s a curve. This is typical of generalised linear models.

  • do not interpret the signs of the coefficients. For example, plogis(-3) is 0.0474259, which is a positive probability.

  • here are is no error term (\(\sigma^2\)) in logistic regression. We only estimate the mean. The variances cannot be estimated:

sjPlot::tab_model(home2)
  HomeOwner
Predictors Odds Ratios CI p
(Intercept) 4.87 4.39 – 5.43 <0.001
Household.INC_s 2.06 1.76 – 2.43 <0.001
Observations 2798
R2 Tjur 0.036
  • we can add points to our graph like this:
plot(ggeffects::ggpredict(home2, 
                     terms = "Household.INC_s[all]"), add.data = TRUE) + scale_x_continuous(limits= c(-1.2,4))

Or to limit our points to the meaningful range:

plot(ggeffects::ggpredict(home2, 
                     terms = "Household.INC_s[all]"), add.data = TRUE) + scale_x_continuous(limits= c(-1.2,4))

  • center (and where it eases interpretation, also scale) your predictor variables.

Uncertainty arises because we only have 17 people in this jittered nz dataset born after 1996:

table(nz$GenCohort)
## 
## Gen Boombers: born >= 1946 & b.< 1961 
##                                  1024 
##                Gen_Silent: born< 1946 
##                                   208 
##          GenX: born >=1961 & b.< 1980 
##                                  1253 
##          GenY: born >=1980 & b.< 1996 
##                                   416 
##                    GenZ: born >= 1996 
##                                    17
ggplot(nz, (aes(GenCohort, Household.INC))) + geom_jitter(alpha = .5)

Which model should we prefer?

The model mg2 grealy improves on the BIC performance, indicating that we should prefer this model.

per_home <-performance::compare_performance(home2,mg1,mg2)
per_home
## # Comparison of Model Performance Indices
## 
## Name  | Model |      AIC |      BIC | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log |   PCP | Score_spherical
## ----------------------------------------------------------------------------------------------------------------
## home2 |   glm | 2590.320 | 2602.193 |     0.036 | 0.381 | 0.962 |    0.462 |      -Inf | 0.708 |                
## mg1   |   glm | 2516.922 | 2546.675 |     0.099 | 0.372 | 0.941 |    0.442 |      -Inf | 0.724 |       3.915e-04
## mg2   |   glm | 2272.307 | 2307.927 |     0.168 | 0.354 | 0.900 |    0.404 |      -Inf | 0.748 |       5.159e-04

Here’s a graph:

plot(per_home)

Note that we do not include the model that used fewer cases because, from the vantage point of information theory, this would be comparing apples with organges.

We can use the performance package to check the accuracy of the our models

Compare:

performance_accuracy(mg1)
## # Accuracy of Model Predictions
## 
## Accuracy: 66.58%
##       SE: 2.77%-points
##   Method: Area under Curve

With:

performance_accuracy(mg2)
## # Accuracy of Model Predictions
## 
## Accuracy: 76.94%
##       SE: 2.40%-points
##   Method: Area under Curve

And this reveals little difference, indicating that with this many data, the outliers don’t affect inference.

Poisson regression (counts)

We use a poisson model for rates and counts

Model checks

performance::check_model(pois1)

We can see the linearity assumption is violated:

performance::check_model(pois2)

library(splines)

pois3 <- glm(y ~ bs(x), data = fake) # remove "family = `poisson`)
model_parameters(pois3)
## Parameter      | Coefficient |   SE |           95% CI |  t(46) |      p
## ------------------------------------------------------------------------
## (Intercept)    |       -6.57 | 2.13 | [-10.74,  -2.39] |  -3.08 | 0.003 
## x [1st degree] |       43.71 | 7.41 | [ 29.19,  58.23] |   5.90 | < .001
## x [2nd degree] |      -67.11 | 5.30 | [-77.50, -56.72] | -12.66 | < .001
## x [3rd degree] |      118.01 | 4.46 | [109.26, 126.76] |  26.44 | < .001
p_pois3<- plot(ggeffects::ggpredict(pois3, terms = "x"), add.data = TRUE)
p_pois3

library(patchwork)
p_pois1 + p_pois2 + p_pois3 + plot_annotation(title = "comparison of three assumed distributions", tag_levels = 'a',
                                              subtitle = "The Poisson model (a) fits \nThe gaussian model (b) underfitsthe \nThe spline model (c) overfits") +
  plot_layout(guides = "collect")

The poisson fits best: this is no surprise: we simulated poisson outcomes.

per_pois <-performance::compare_performance(pois1,pois2,pois3)
per_pois
## # Comparison of Model Performance Indices
## 
## Name  | Model |     AIC |     BIC |   RMSE |  Sigma |    R2 | Nagelkerke's R2 | Score_log | Score_spherical
## -----------------------------------------------------------------------------------------------------------
## pois1 |   glm | 188.481 | 192.305 |  3.361 |  0.985 |       |           1.000 |    -1.845 |           0.097
## pois2 |   glm | 426.698 | 432.434 | 16.249 | 16.584 | 0.521 |                 |           |                
## pois3 |   glm | 300.273 | 309.833 |  4.410 |  4.597 | 0.965 |                 |           |
plot(per_pois)

Negative binomial models (over-dispersed Poissons)

Zero-inflated poisson/ neg binomial regression

Model syntax

library(brms)

# Requires integer output 
nz$HoursCharity <- as.integer(nz$HoursCharity)

# 
nz['Household.INC_s'] = as.data.frame(scale(nz$Household.INC))

# Scale religion variabile
nz['Relid_s'] = as.data.frame(scale(nz$Relid))

b0<- brms::brm(HoursCharity ~ Relid_s + Household.INC_s, 
                family = "zero_inflated_poisson",
                file = here::here("models", "zeroinflated_poisson_volunteer"), 
               data = nz)
summary(b0)
##  Family: zero_inflated_poisson 
##   Links: mu = log; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept           1.70      0.02     1.67     1.73 1.00
## Relid_s             0.00      0.01    -0.02     0.03 1.00
## Household.INC_s    -0.09      0.02    -0.12    -0.05 1.00
##                 Bulk_ESS Tail_ESS
## Intercept           3785     3068
## Relid_s             4501     3362
## Household.INC_s     3704     3054
## 
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## zi     0.70      0.01     0.68     0.72 1.00     3849
##    Tail_ESS
## zi     2989
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

My preferred way of graphing the predicted effects of religious identification on volunteering is to generate prediction lines from the posterior samples. This captures uncertainty in an intuitive way. I’ll explain the mechanics when we get to Bayesian estimation, which, by the way, we are already doing here.

plot(
  conditional_effects(
    b0,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
  point_args = list(alpha = 0.1,
                    width = .02)
) #  note this command controls which facet 

b1 <- brms::brm(HoursCharity ~ Relid_s + Household.INC_s, 
                family = "zero_inflated_negbinomial",
                file = here::here("models", "zeroinflated_neg_bin_volunteer"),
                data = nz)
summary(b1)
##  Family: zero_inflated_negbinomial 
##   Links: mu = log; shape = identity; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept           0.89      0.17     0.52     1.17 1.00
## Relid_s             0.21      0.06     0.11     0.33 1.00
## Household.INC_s    -0.07      0.05    -0.16     0.03 1.00
##                 Bulk_ESS Tail_ESS
## Intercept            949      885
## Relid_s             1456     1698
## Household.INC_s     2061     2040
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## shape     0.28      0.07     0.16     0.42 1.00      938
## zi        0.35      0.11     0.07     0.50 1.00      939
##       Tail_ESS
## shape      914
## zi         697
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Compare fits of the two models using leave one out cross-validation.

Note that on this index, negative numbers are worse (sorry to confuse…)

b0 <- add_criterion(b0, "loo")
b1 <- add_criterion(b1, "loo")

w <-loo_compare(b0, b1, criterion = "loo")
w
##    elpd_diff se_diff
## b1     0.0       0.0
## b0 -1514.8     207.5

Recall we hadbeen using and AIC/BIC convention to estimate improvements in goodness of fit. We can generate an analous index as follows:

cbind(waic_diff = w[, 1] * -2,
      se        = w[, 2] * 2)
##    waic_diff       se
## b1     0.000   0.0000
## b0  3029.657 414.9648

And we can see that the negative binomial model fits much better.

Predicting zeros

The estimates in the graphs above are only for the positive (non-zero components of the model). Let’s look at the results again:

summary(b1)
##  Family: zero_inflated_negbinomial 
##   Links: mu = log; shape = identity; zi = identity 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Rhat
## Intercept           0.89      0.17     0.52     1.17 1.00
## Relid_s             0.21      0.06     0.11     0.33 1.00
## Household.INC_s    -0.07      0.05    -0.16     0.03 1.00
##                 Bulk_ESS Tail_ESS
## Intercept            949      885
## Relid_s             1456     1698
## Household.INC_s     2061     2040
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## shape     0.28      0.07     0.16     0.42 1.00      938
## zi        0.35      0.11     0.07     0.50 1.00      939
##       Tail_ESS
## shape      914
## zi         697
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The probability of non-volunteering in the preferred model for people who are at the mean Religious Identification and mean Household income in this population is plogis(.35) or 0.5866176. More often than not, we should predict zeros in this population. What predicts the zero component of the model? We can use this syntax:

b2 <- brms::brm(
  bf(HoursCharity ~ Relid_s + Household.INC_s, # note: use `bf` when you have more than one model, as we do here
     zi ~ Relid_s + Household.INC_s),
  family = "zero_inflated_negbinomial",
  file = here::here("models", "zeroinflated_nb_2_volunteer"),
  data = nz)

Let’s look at the results:

sjPlot::tab_model(b2)
  HoursCharity
Predictors Incidence Rate Ratios CI (95%)
Count Model
Intercept 3.40 2.85 – 3.95
Relid_s 1.02 0.94 – 1.11
Household.INC_s 0.93 0.84 – 1.03
Zero-Inflated Model
Intercept 1.06 0.72 – 1.38
Relid_s 0.52 0.42 – 0.61
Household.INC_s 1.02 0.87 – 1.17
Observations 2796
R2 Bayes 0.015

Interpretation

We can graph the results using ggeffects::

Religious identification:

plot(ggeffects::ggpredict(b2, terms = c("Relid_s")), 
    add.data = TRUE,  # doesn't work
     dot.alpha = .2,  
     facet = TRUE)  + ylim(0, 5)

Household income:

plot(ggeffects::ggpredict(b2, terms = c("Household.INC_s")), 
    add.data = TRUE,  # doesn't work
     dot.alpha = .2,  
     facet = TRUE)  + ylim(0, 5) +  xlim(0, 5)

Or using my preferred method

Predicted effects of religious identification:

plot(
  conditional_effects(
    b2,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
 point_args = list(alpha = 0.05,
                    width = .02),
 ask = FALSE
)  + # note this command controls which facet 
  ylim(0,5)

## NULL

Note:

Models of class brmsfit always condition on the zero-inflation component, if the model has such a component. Hence, there is no type = “zero_inflated” nor type = “zi_random” for brmsfit-models, because predictions are based on draws of the posterior distribution, which already account for the zero-inflation part of the model.

See package description

Comparing models we find that adding the prdictors improves the model

b2 <- add_criterion(b2, "loo")

w <-loo_compare(b0, b1, b2,  criterion = "loo")
w
##    elpd_diff se_diff
## b2     0.0       0.0
## b1   -50.0      10.1
## b0 -1564.8     206.1

Again we can use an -2 * loglik analogue


cbind(waic_diff = w[, 1] * -2,
      se        = w[, 2] * 2)
##    waic_diff        se
## b2    0.0000   0.00000
## b1  100.0416  20.14189
## b0 3129.6983 412.13260

Another way to see this is to look at the posterior predictive checks on the models

Zero-inflated poisson

brms::pp_check(b0) + xlim(0, 5)

Zero-inflated negative binomial with no predictors for the zero-inflation part.

brms::pp_check(b1) + xlim(0, 5)

Zero-inflated negative binomial + predictors for the zero-inflation part

brms::pp_check(b2) + xlim(0, 5)

summary(b2)
##  Family: zero_inflated_negbinomial 
##   Links: mu = log; shape = identity; zi = logit 
## Formula: HoursCharity ~ Relid_s + Household.INC_s 
##          zi ~ Relid_s + Household.INC_s
##    Data: nz (Number of observations: 2796) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI
## Intercept              1.22      0.08     1.05     1.37
## zi_Intercept           0.05      0.16    -0.33     0.32
## Relid_s                0.02      0.04    -0.06     0.10
## Household.INC_s       -0.07      0.05    -0.17     0.03
## zi_Relid_s            -0.66      0.09    -0.88    -0.50
## zi_Household.INC_s     0.02      0.07    -0.14     0.16
##                    Rhat Bulk_ESS Tail_ESS
## Intercept          1.00     1657     2031
## zi_Intercept       1.00     1466     1395
## Relid_s            1.00     3142     2943
## Household.INC_s    1.00     3404     2974
## zi_Relid_s         1.00     2096     1704
## zi_Household.INC_s 1.00     2964     2613
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## shape     0.45      0.06     0.33     0.58 1.00     1510
##       Tail_ESS
## shape     1500
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Reporting

You can start with the report package to generate reports. However, be careful about reporting using boilerplate because it is up to you, as an applied scientist, to interpret your results

For example the report package describes the \(R^2\) statistic. For reasons we can discuss in lab, this statistic is problematic, especially for observational studies.

A better option for expressing practical interest of a result is to describe the expected values of out an outcome across a range of predictors. We have been doing this all along with our graphical models. However, you might want to report the numbers for these expected values as well. For this purpose, the ggeffects package can be handy. You merely leave out the plot component of your syntax:

ggeffects::ggpredict(b2, terms = "Relid_s")
## # Predicted counts of HoursCharity
## # x = Relid_s
## 
##     x | Predicted |       95% CI
## --------------------------------
## -0.64 |      1.29 | [1.14, 1.46]
## -0.25 |      1.51 | [1.36, 1.68]
##  0.15 |      1.74 | [1.58, 1.92]
##  0.54 |      1.97 | [1.78, 2.18]
##  0.94 |      2.19 | [1.96, 2.44]
##  1.33 |      2.40 | [2.13, 2.72]
##  1.73 |      2.60 | [2.26, 2.99]
##  2.12 |      2.78 | [2.36, 3.27]
## 
## Adjusted for:
## * Household.INC_s = 0.01

Here, we have generated the predicted magnitudes of volunteering across the full range of the response variables (Religious Identification, as measured in standard deviation units). The expected volunteering rate of a secular person when other predictors are set to zero is: 1.23 hours [CI 95% 1.14, 1.46] and for a fully religiously identified person when all other predictors are set to zero it is 2.78 hours (CI 95% [2.35, 3.28])

We might want to focus on different elements of the model. The expected volunteering among the secular population (when other predictors are set to zero) who volunteersis exp(1.23): 3.421 and for a religious person (when other predictors are set to zero) who volunteers is: exp(1.23 + .02 * 2.12):3.569. expected zero-rate among the secular population is plogis(0.05) 0.512; for a fully identified religious person it is plogis(0.05 + -0.65 * 2.12): 0.209. This suggests that religious people are less zero-inflated. However among those who volunteer, religion isn’t predicting much of a difference in the number of hours one volunteers (in the baseline population).

Is this interesting? Rather than averaging across both the zero-inflated and volunteering populations, it might be more informative to separately describe behaviour among each of these distinct populations. Remember, as applied scientists, we are not merely trying to be explicit about our assumptions and decisions. We’re trying to improve our beliefs relating to some concrete question. Our reporting must reflect our scientific interests. It must clarify what we’ve learned about the world, as well as how we remain uncertain.

Summary

Today:

Appendix 1

If you want to graph each predictor separately emply the [[1]] or [[2] syntax as follows:]

Predicted effects of religious identification

b2_p1 <- plot(
  conditional_effects(
    b2,
    spaghetti = TRUE,
    nsamples = 100,
    select_points = 0.1
  ),
  points = TRUE,
  ask = TRUE,
  point_args = list(alpha = 0.1,
                    width = .02)
)[[1]]  + # note this command controls which facet
  ylim(0, 5) + labs(title = "Better title",
                                 subtitle = "better subtitle") +
  xlab("Religious Identification (SD units)") +
  ylab("Hours volunteering in the previous week")

Predicted effects of income:

library("patchwork")
b2_p1 + b2_p2

Appendix 2

We have covered lots of ground today. The generalised linear models described here are the most commponplace. However there are many more. For those who are curious on how to estimated zero inflated binary data, I recommend Solomon Kurz’s work here.

Appendix 3

Aside: In Bayesian estimation we often estimate the standard deviation of the mean \(\sigma\) , however outside these circles people we describe the same parameter as the variance of the mean, or \(\sigma^2\).

It doesn’t really matter the parameters don’t care how we express them. We can say, “Johannes is 2 meters tall,” or we can say “Johannes is the square root of 4 meters tall” and his height will remain indifferent to our convention. So too, the world is indifferent to whether we write $\sigma$ o r $\sigma^2$

Acknowledgments and References

@gelman2020

@kurz2020

@bürkner2019

Stashed changes


  1. It is worth cultivating this explicit habit because we will need it when clarifying our belielfs about the world prior to modelling the world (as we do in Bayesian modelling).↩︎